Goal

Using NBA data sets containing player performance metric and salary information, I want to identify quality players who are currently being underpaid. Identifying such players will enable management to run a targeted recruiting campaign and hopefully gain the players necessary to reach the playoffs!

Methods

I will cluster players based on their performance metrics and then evaluate the characteristics of the cluster groups to distinguish between “high” and “low” performing players. With this knowledge of cluster characteristics, I will then visualize the clusters and use a salary heat map to identify players within the high performing cluster who are lower paid.

Data description

First, I loaded and then merged the two required data sets together by player name. Next I did some preliminary data cleaning by removing any special characters and NAs. I then normalized the numeric values, excluding the ages and salaries. After this initial cleaning, I decided to investigate how best to cluster my data.

Best Cluster Number?

For my clusters, I decided to use all the normalized performance data (columns 5 to 29). Seeing as salary is the variable I am trying to understand it was excluded from the clustering. I also decided to exclude age as it was not a reflection of how “good” a player could be and in preliminary tests it resulted in lower explained variance values.

The kmeans() clustering function in R takes requires a “centers” argument that determines the number of clusters, k, your data will be divided into. In order to determine the best k for this data set I applied the elbow method. This method uses the explained variance (a measure of cluster quality) for models with different k values to determine the point of diminishing returns. In other words, the turning point of the elbow method shows the point where increasing model complexity does not yield large returns in quality (a larger explained variance value).

Below is an elbow method curve showing the flattening off of explained variance for increasing k values. In my clustering model I decided to use k = 4 since it was the point in which the curve became much flatter and it yielded an expected variance of 61.7%.

Characteristics of Clusters

After clustering my data into 4 distinct groups, I wanted to determine what characteristics those clusters shared. I then created bar charts showing the average values of certain performance metrics for each cluster.

Game Statistics

The game statistics show that clusters 1 and 2 on average have less playing time and are generally not starts compared to clusters 3 and 4.

Scoring Statistics

The scoring statistics show that group 3 is the primary scorer, out performing all clusters in each scoring category. Most importantly, cluster 3 is the primary points scorer. Additionally, group 2 is the lowest in all 4 categories. Combining the scoring knowledge with the game statistics, it seems as though cluster 2 represents low performing players.

Defensive Statistics

The defensive statistics now give us an insight into possible position divisions within the clusters. Cluster 4’s higher defensive scores combined with their lower scoring numbers and large number of minutes played could indicate that these players are more defensive. It should be noted that cluster 3 is still very high performing, reemphasizing that these players are likely “stars”.

Salary Statistics

Now looking at how the clusters impact the average salary, we can see that cluster 3 has the highest paid individuals. This is another piece of evidence supporting the hypothesis that cluster 3 are “star” quality players. Therefore, we will want to find players that fall under this cluster 3 performance category but are paid below average.

Under or Over Paid?

To determine if a player was under or overpaid I decided to subtract from each player’s salary the mean salary of their cluster. I then made tables of the top three most underpaid players.

Cluster 1 underpaid

Player difference
XavierTillmanSr -7283531
EricPaschall -7065550
LuguentzDort -7065550

Cluster 2 underpaid

Player difference
AndreRoberson -4390595
ThonMaker -4234079
BrunoCaboclo -4182178

Cluster 3 underpaid

Player difference
JohnCollins -18176538
ShaiGilgeousAlexander -18172520
CollinSexton -17321960

Cluster 4 underpaid

Player difference
JaeSeanTate -7126196
IsaiahRoby -7054182
NazReid -7054182

Visualizing Clusters and Salary

Below is a graph showing the distribution of players and clusters in the 2D feature space of “PTS vs. MP”. I chose this visualization as it is easy to distinguish cluster 3 (square marker) as these players are high in both minutes played and points scored. Having isolated cluster 3, the “star” cluster, it’s clear that there are some players who have lower salaries and are colored blue/ dark purple.

Then choosing the darkest blue players in the middle of the cluster group 3 we can find three potential players to target; John Collins, Collin Sexton, and Alexander Shai Gilgeous. These players are the same players identified in the most underpaid table.

Identifying Players

By clustering based on performance metrics, high performing players (cluster 3) were isolated from average and low performing players (cluster 1, 2, and 4). Then based on these cluster, underpaid players were identified both in the underpaid table and visually through the “MP vs. PTS” graph. Based on this analysis the three players that should be targeted are:

  1. John Collins
  2. Alexander Shai Gilgeous
  3. Collin Sexton

These players were selected because they are the most underpaid within cluster 3, the “star” player cluster, and therefore the most likely to leave their current teams.